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psj Abstract 

^ We present a probabilistic viewpoint to multiple kernel learning unifying well-known regu- 

^ larised risk approaches and recent advances in approximate Bayesian inference relaxations. 

' ' The framework proposes a general objective function suitable for regression, robust regres- 

sion and classification that is lower bound of the marginal likelihood and contains many 
regularised risk approaches as special cases. Furthermore, we derive an efficient and prov- 
ably convergent optimisation algorithm. 
^! Keywords: Multiple kernel learning, approximate Bayesian inference, double loop algo- 

rithms, Gaussian processes 

^ 1. Introduction 

> 

— Nonparametric kernel methods, cornerstones of machine learning today, can be seen from 

different angles: as regularised risk minimisation in function spaces (Scholkopf and Smola, 
2002), or as probabilistic Gaussian process methods (Rasmussen and WilUams, 2006). In 
these techniques, the kernel (or equivalently covariance) function encodes interpolation char- 
acteristics from observed to unseen points, and two basic statistical problems have to be 
mastered. First, a latent function must be predicted which fits data well, yet is as smooth 
as possible given the fixed kernel. Second, the kernel function parameters have to be learned 
as well, to best support predictions which are of primary interest. While the first problem is 
^ simpler and has been addressed much more frequently so far, the central role of learning the 

^ covariance function is well acknowledged, and a substantial number of methods for "learn- 

ing the kernel", "multiple kernel learning", or "evidence maximisation" are available now. 
However, much of this work has firmly been associated with one of the "camps" (referred 
to as regularised risk and probabilistic in the sequel) with surprisingly little crosstalk or 
acknowledgments of prior work across this boundary. In this paper, we clarify the relation- 
ship between major regularised risk and probabilistic kernel learning techniques precisely, 
pointing out advantages and pitfalls of either, as well as algorithmic similarities leading to 
novel powerful algorithms. 

We develop a common analytical and algorithmical framework encompassing approaches 
from both camps and provide clear insights into the optimisation structure. Even though, 
most of the optimisation is non convex, we show how to operate a provably convergent 
"almost Newton" method nevertheless. Each step is not much more expensive than a gradient 
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based approach. Also, we do not require any foreign optimisation code to be available. Our 
framework unifies kernel learning for regression, robust regression and classification. 

The paper is structured as follows: In section 2, we introduce the regularised risk and the 
probabilistic view of kernel learning. In increasing order of generality, we explain multiple 
kernel learning (MKL, section 2.1), maximum a posteriori estimation (MAP, section 2.2) 
and marginal likelihood maximisation (MLM, section 2.3). A taxonomy of the mutual 
relations between the approaches and important special cases is given in section 2.4. Section 
3 introduces a general optimisation scheme and section 4 draws a conclusion. 

2. Kernel Methods and Kernel Learning 

Kernel-based algorithms come in many shapes, however, the primary goal is - based on 
training data {(xj,yj)|i = l..n}, Xi £ X, yi £ y and a parametrised kernel function 
kg{x,x.') - to predict the output for unseen inputs x,,,. Often, linear parametrisations 
^^(x, x') = ^mfcm(x, x') are used, where the km are fixed positive definite functions, 

and 0^0. Learning the kernel means finding 6 to best support this goal. In general, kernel 
methods employ a postulated latent function ti : — t- M whose smoothness is controlled via 
the function space squared norm ||'u(-)|||g. Most often, smoothness is traded against data 
fit, either enforced by a loss function i{yi,u{xi)) or modeled by a likelihood ¥{yi\ui). Let 
us define kernel matrices Kg := [kg(xi,x.j)jij , and Km := [km{^i,^j in M"^" and the 
vectors y := [yi]i G 3^", u := [n(xj)]j G M" collecting outputs and latent function values, 
respectively. 

The regularised risk route to kernel prediction, which is followed by any support vector 
machine (SVM) or ridge regression technique, yields llMCOIIIg + „ ^7=1 ^{Vi^ criterion, 
enforcing smoothness of u(-) as well as good data fit through the loss function u(xj)). 
By the representer theorem, the minimiser can be written as n(-) = aj/!;0(-, Xj), so that 
||it(-)|||^ = a^l:^0cy. (Scholkopf and Smola, 2002). As u = KgO;, the regularised risk problem 
is given by 



A probabilistic viewpoint of the same setting is based on the notion of a Gaussian process 
(GP) (Rasmussen and Williams, 2006): a Gaussian random function u(-) with mean func- 
tion E[ti(x)] = m(x) = and covariance function V[u(x), u(x')] = E[n(x)n(x')] = ke{x,x.'). 
In practice, we only use finite-dimensional snapshots of the process n(-): for example, 
P(u; 0) = J\f{u\0,'Kg), a zero-mean joint Gaussian with covariance matrix K^. We adopt 
this GP as prior distribution over u{-), estimating the latent function as maximum of the 
posterior process ¥{u{-)\y,0) oc P(y|u)P(n(-); 0). Since the likelihood depends on «(•) only 
through the finite subset {ti(xj)}, the posterior process has a finite-dimensional represen- 
tation: P(n(-)|y,u) = P(u(-)|u), so that P(u(-)|y;0) = /P('u(-)|u)P(u|y;0)du is specified 
by the joint distribution P(u|y;0), a probabilistic equivalent of the representer theorem. 
Kernel prediction amounts to maximum a posteriori (MAP) estimation 

maXuP(u|y;0) = maxu P(u; 0)P(y|u) = minu u^K^^u - 21nP(y|u) -Fin IK^I, (2) 

ignoring an additive constant. Minimising equations (1) and (2) for any fixed kernel matrix 
K gives the same minimiser u and prediction ^(x,,,) = u'''Kg^[A;e(xj,x*)]j. 




(1) 



■t=i 
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y 


Loss function 




ny^\ni) 


Likelihood 


{±1} 


SVM Hinge loss 


niax(0, 1 - ytU^) 


$ 


{±1} 


Log loss 


ln(exp(-yj'Ui) + 1) 


l/{exp{-Ty,u,) + 1) 


Logistic 


M 


SVM e-insensitive loss 


max(0, \yi - u^\/e - 1) 


$ 


K 


Quadratic loss 






Gaussian 


M 


Linear loss 


\yi - Ui\ 




Laplace 



Table 1: Relations between loss functions and likelihoods 



The correspondence between likelihood and loss function bridges probabilistic and reg- 
ularised risk techniques. More specifically, any likelihood P(y|u) induces a loss function 
£(y, u) via 

„ n 

-21nP(y|u) = -2^1nP(yi|n,) ^ -^^(y,,n,) = £(y,u), 

i i=l 

however some loss functions cannot be interpreted as a negative log likelihood as shown 
in table (1) and as discussed for the SVM by Sollich (2000). If, the likelihood is a log- 
concave function of u, it corresponds to a convex loss function (Boyd and Vandcnbcrglie, 
2002, Sect. 3.5.1). Common loss functions and likelihoods for classification y = {±1} and 
regression 3^ = M are listed in table (1). 

In the following, we discuss several approaches to learn the kernel parameters 9 and 
show how all of them can be understood as instances of or approximations to Bayesian 
evidence maximisation. Although the exposition MKL section 2.1 and MAP section 2.2 use 
a linear parametrisation i— )• Kg = '}2m=i ^m^m, much of the results in MLM 2.3 and all 
the aforementioned discussion are still applicable to non- linear parametrisations. 

2.1 Multiple Kernel Learning 

A widely adopted regularised risk principle, known as multiple kernel learning (MKL) (Chris- 
tianini et al., 2001; Lanckriet et al., 2004; Bach et al., 2004), is to minimise equation (1) 
w.r.t. the kernel parameters 9 as well. One obvious caveat is that for any fixed u, equation 
(1) becomes ever smaller as 9m — ?• oo: it cannot per se play a meaningful statistical role. In 
order to prevent this, researchers constrain the domain oi 9 £ & and obtain 

minmin u^K^ + i(y, u), 

where © = {9^0, \\9\\2 < 1} or = {0 ^ 0, 1^9 < 1} (Varma and Ray, 2007). Notably, 
these constraints are imposed independently of the statistical problem, the model and of the 
parametrization 9 i— t- Kg. The Lagrangian form of the MKL problem with parameter A and 
a general p-norm unit ball constraint where p > 1 (Kloft ct al., 2009) is given by 

min(/)MKL(6'), where (puKhiO) := minu^K-^u + £(y, u) + A - 1^, A > 0. (3) 

Since, the regulariser p{9) for the kernel parameter 9 is convex, the map (u, K) i— )• 
u'''K~^u is jointly convex for K ^ (Boyd and Vandcnljcrghc, 2002) and the parametrisa- 
tion 9 I— 7- Kg is linear, MKL is a jointly convex problem for 9^0 whenever the loss function 
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In |K I linear upper bound In |K | quadratic upper bound 




Figure 1: Convex upper bounds on (the concave non-decreasing) InlK^I 
By Fenchel duality, we can represent any concave non-decreasing function and hence the log 
determinant function by InlK^I = min;v^o -^^1^1^ ~ 9*{^)- ^ consequence, we obtain a 
piecewise polynomial upper bound for any particular value A. 

i{y, u) is convex. Furthermore, there are efficient algorithms to solve equation (3) for large 
models (Sonnenburg et al., 2006). 

2.2 Joint MAP Estimation 

Adopting a probabilistic MAP viewpoint, we can minimise equation (2) w.r.t. u and 6^0: 
min(/>MAp(^)i where 0MAp(^) := iiiin u^K^^u — 2 InPfylu) -|- In iKel. (4) 

0^0 u 

While equation (3) and equation (4) share the "inner solution" u for fixed - in case 
the loss ^(y,u) corresponds to a likelihood P(y|u) - they are different when it comes to 
optimising 0. The joint MAP problem is not in general jointly convex in {6,u), since 
I— 7- In I Kg I is concave, see figure 2. However, it is always a well-posed statistical procedure, 
since In [K^l — t- oo as 0^ — ^ co for all m. 

We show in the following, how the regularisers p{6) = A||0||p of equation (3) can be 
related to the probabilistic term f{0) = In \ ^e\- In fact, the same reasoning can be applied 
to any concave non-decreasing function. 

Since the function i— t- f{0) = InlK^I, ^ is jointly concave, we can represent it 
by f{6) = min;^^>o •^"''^ ~ f*{^) where /*(A) denotes Fenchel dual of f{0). Furthermore, 
the mapping i? i— )■ In | X]m=i V^^m^ml = fi'^f^) = 9{''^)) i? ^ is jointly concave due to 
the composition rule (Boyd and Vandcnbcrghc, 2002, §3.2.4), because i9 i— )■ is jointly 
concave and 6 i— ?■ f{9) is non-decreasing in all components O^a as all matrices are positive 
(semi-) definite which guarantees that the eigenvalues of increase as 9^ increases. Thus 
we can - similarly to Zhang (2010) - represent In IK^I as 

In iKel = f{e) = g{^) = mm A^i? - g*{\) = mm A^|0|p - g*{\). 

Choosing a particular value A = A • 1, we obtain the bound In IK^I < A • — g*{\- 1). 
Figure 1 illustrates the bounds for p = 1 and p = 2. The bottom line is that one can 
interpret the regularisers p{0) = A||0||p in equation (3) as corresponding to parametrised 
upper bounds to the InlK^I part in equation (4), hence (puKhi^) = V'MAp(^; A = A • 1), 
where (;/>map(^) = minA^o V'MAp(^) A). Far from an ad hoc choice to keep 6 small, the 
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In I Kg I term embodies the Occam's razor concept behind MAP estimation: overly large 
are ruled out, since their explanation of the data y is extremely unlikely under the prior 
P(u; 9). The Occam's razor effect depends crucially on the proper normalization of the prior 
(MacKay, 1992). For example, the weighting parameter C of k {k = Ck) can be learned by 
joint MAP: if C = e'^, then equation (4) is convex in c for any fixed u. If kernel-regularised 
estimation equation (1) is interpreted as MAP estimation under a GP prior equation (2), 
the correct extension to kernel learning is joint MAP: the MKL criterion equation (3) lacks 
prior normalization, which renders MAP w.r.t. meaningful in the first place. From a non- 
probabilistic viewpoint, the In IK^I term comes with a model and data dependent structure 
at least as complex as the rest of equation (3). 

While the MKL objective, equation (3), enjoys the benefit of being convex in the (linear) 
kernel parameters 0, this does not hold true for joint MAP estimation, equation (4), in 
general. We illustrate the differences in figure 2. The function ^/;map(^)U) is a building 
block of the MAP objective (/'map(^) = miiiuiV'MApC^, u) — 21nP(y|u)], where 



^MAP(6', u) = uTK01u + InlKej < Vmkl(6', u) -5*(A-1), ^PuKhie, u) = u^R-^u+A ||6I||^ . 



More concretely, V'MAp(^) u) is a sum of a nonnegative, jointly convex function ipvj{6, u) that 
is strictly decreasing in every component dm and a concave function V'n(^) that is strictly 
increasing in every component Om- Both functions ipyj{0,\i) and ipn{^) alone do not have 
a stationary point due to their monotonicity in 9^- However, their sum can have (even 
multiple) stationary points as shown in figure 2 on the left left. We can show, that the map 
K I— 7- u~''K~-^u -|- ln|K| is invex i.e. every stationary point K is a global minimum. Using 
the convexity of A i— )• Au — In | A| (Boyd and Vandenberghe, 2002) and the fact that the 
derivative of A i— )• A"-*^ for A € M"^", A has full rank n^, we see by Alishra and Giorgi 
(2008, theorem 2.1) that K i— t- u'''K~^u -|- In |K| is indeed invex. 

Often, the MKL objective for the case p = 1 is motivated by the fact that the optimal 
solution 0* is sparse (e.g. Sonnenburg et al., 2006), meaning that many components 9m are 
zero. Figure 2 illustrates that i?^map(^) also yields sparse solutions; in fact it enforces even 
more sparsity. In MKL, (/>map(^) is simply relaxed to a convex objective (puKhi^^) at the 
expense of having only a single less sparse solution. 



Intuition for the Gaussian Case 

We can gain further intuition about the criteria (pMKL and (puAP by asking which matrices 
K minimise them. For simplicity, assume that P(y|u) = AA(y|u, cr^I) and n/C = cr^, 
hence ^(y, u) = ^ ||y — uHg. The inner minimiser u for both ^mkl and (pMAP is given by 
K^^u = (Ke + cr^I)" V- With 0, we find for joint MAP that 0map = results in 

K = yy^. While this "nonparametric" estimate requires smoothing to be useful in practice, 
closeness to yy^ is fundamental to covariance estimation and can be found in regularised 
risk kernel learning work (Christianini et al., 2001). On the other hand, for tr(Km) = 1 
and hence p{6) = Atr(K0) = A \\0\\-^^, ^4>mkl = leads to = A~^yy''': an odd way of 
estimating covariance, not supported by any statistical literature we are aware of. 
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Figure 2: Convex and non-convex building blocks of the MKL and MAP objective function 



2.3 Marginal Likelihood Maximisation 

While the joint MAP criterion uses a properly normalised prior distribution, it is still not 
probabilistically consistent. Kernel learning amounts to finding a value of high data 
likelihood, no matter what the latent function u{-) is. The correct likelihood to be maximised 
is marginal: P(y|0) = J P(y|u)P(u|0)du ("max-sum"), while joint MAP employs the plug-in 
surrogate maxu P(y|u)P(u|0) ("max-max"). Marginal likelihood maximisation (MLM) is also 
known as Bayesian estimation, and it underlies the EM algorithm or maximum likelihood 
learning of conditional random fields just as well: complexity is controlled (and overfitting 
avoided) by averaging over unobserved variables u (MacKay, 1992), rather than plugging in 
some point estimate u 

(/'mlm(6') := -2 In j AA(u|0, K0)P(y|u)du. (5) 

The Gaussian Case 

Before developing a general MLM approximation, we note an important analytically solvable 
exception: for Gaussian likelihood P(y|u) = A/'(y|u, cj^I), P(y|0) =M{y\Q,'Ke + o"2l), and 
MLM becomes 

<t>GA\j{e) ■■= y''(Ke + a^r^y + In jK^ + aHy (6) 

Even if the primary purpose is classification, the Gaussian likelihood is used for its analytical 
simplicity (Kapoor ct al., 2009). Only for the Gaussian case, joint MAP and MLM have an 
analytically closed form. From the product formula of Gaussians (Brookes, 2005, §5.1) 

Q(u) := AA(u|0, Ke)AA(y|u, T) = AA(y|0, Kg + r)AA(u|m, V), 
where V = (K^ ^ + T^^)"! and m = VF^V 

we can deduce that 

-21n j Q(u)du = ln|K0^-Fr-^|-Fmin[-21nQ(u)]-nln|27r|. (7) 

Using cr^I = r and minu[— 2 InQ(u)] = — 21nQ(m), we see that by 

0map/gau(0) := 'Agau(^) - In |K^ ^ + <j-^l\ ^ y'^ {^g + a^I)- V + In \Ke\ (8) 
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MLM and MAP are very similar for the Gaussian case. 

The "ridge regression" approximation is also used together with p-norm constraints in- 
stead of the In IK^I term (Cortes ct al., 2009) 

Md) ■■= y^i^e + ^'i)- V + A \\e\\i . (9) 

Unfortunately, most GP methods to date work with a Gaussian likelihood for simplicity, a 
restriction which often proves short-sighted. Gaussian-linear models come with unrealistic 
properties, and benefits of MLM over joint MAP cannot be realised. 

Kernel parameter learning has been an integral part of probabilistic GP methods from 
the very beginning. Williams and Rasmussen (1996) proposed MLM for Gaussian noise 
equation 6, fifteen years ago. They treated sums of exponential and linear kernels as well 
as learning lengthscales (ARD), predating recent proposals such as "products of kernels" 
(Varma and Babu, 2009). 

The General Case 

In general, joint MAP always has the analytical form equation 4, while P(y|0) can only be 
approximated. For non-Gaussian P(y|u), numerous approximate inference methods have 
been proposed, specifically motivated by learning kernel parameters via MLM. The simplest 
such method is Laplace's approximation, applied to GP binary and multi-way classification 
by Williams and Barber (1998): starting with convex joint MAP, InP(y,u) is expanded to 
second order around the posterior mode u. More recent approximations Girolami and Rogers 
(2005); Girolami and Zlioii.e, (200(i) can be much more accurate, yet come with non-convex 
problems and less robust algorithms (Nickisch and Rasmussen, 2008). In this paper, we 
concentrate on the variational lower bound relaxation (VB) by Jaakkola and Jordan (2000), 
which is convex for log-concave likelihoods P(y|u) (Xickiscli and Sccgcr, 2009), providing a 
novel simple and efficient algorithm. While our VB approximation to MLM is more expensive 
to run than joint MAP for non-Gaussian likelihood (even using Laplace's approximation), 
the implementation complexity of our VB algorithm is comparable to what is required in 
the Gaussian noise case equation 6. 

More, specifically, we exploit that super-Gaussian of likelihoods ¥{yi\ui) can be lower 
bounded by scaled Gaussians AA-y- of any width 7^: 

/ li^ 1 \ 

^{yi\ui) = maxAT^i = maxexp (3iUi - ^ - 7:^i(7i) > 
7i>0 7i>0 y 2 J 

where /3j oc yi are constants, and /li(-) is convex (Nickisch and Sccgcr, 2009) whenever the 
likelihood P(yj|tij) is log-concave. If the posterior distribution is P(u|y) = Z-ip(y|u)P(u), 
then InZ > C'e-^'vB(e.7)/2 by pi^g 

ging in these bounds, where C is a constant and 

HB{e) :=minVvB(6',7), ^Vb(6',7) :=/i(7)-21n / AA(u|0, K0)e"^(^-^^"'")du, (10) 

^(t) '■— X^i ^i(7i)) r := dg(7). The variational relaxation^ amounts to maximising the lower 
bound, which means that P(u|y) is fitted by the Gaussian approximation Q(u|y; 7) with co- 

1. Generalisations to other super-Gaussian potentials (log-concave or not) or models including linear cou- 
plings and mixed potentials are given by Nickistli and Sccgcr (2009). 
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variance matrix V = (K^^+F"^)"^ (Nickisch and Seeger, 2009). Alternatively, we can inter- 
pret i/jy-sid, 7) as an upper bound on the Kullback-Leibler divergence KL(Q(u|y; 7)! |P(u|y)) 
(Nickisch, 20] 0, §2.5.9), a measure for the dissimilarity between the exact posterior P(u|y) 
and the parametrised Gaussian approximation Q(u|y;7). 

Finally, note that by equation (7), ^vb(^)7) can also be written as 

^/'vb(6',7) =ln|K.i +F-i| +/i(7) + mini2(u,0,7) +ln|K0|, (11) 

u 

where R{u,d,^) = u^(Kg^ +F"^)u-2/3'^u. Using the concavity of 7"^ 1-^ InlK^^ + F^^I 
and Fenchel duality ln|K^^ + r^^| = minz;^oz^7~^ — 5^(z) = z^-f^^ — gQ{zg), with the 
optimal value zg = dg(V), we can reformulate V'VbI^jT) ^ 

V'Vb(^,7) = min[z'^7~-^ - ^^(z)] + h{-y) + mini?(u, d,j) + In \Kg\, 

z>~0 u 

which allows to perform the minimisation w.r.t. 7 in closed form (Nickisch, 2010, §3.5.6): 
4>yB{0) = min^vB(^,z), 'i/'VB(^,z) = minu^K^^u + 4(y, u) - g^(z) + ln|K6»|, (12) 

z>-0 u 

where Iziy, u) := 2/3^(v — u) — 2 lnP(y|v) and finally v = sign(u) + z. Note that for 
z = 0, we exactly recover joint MAP estimation, equation (4), as z = implies u = v and 
^z(y,u) = £(y,u). For fixed 6, the optimal value zg = dg(V) corresponds to the marginal 
variances of the Gaussian approximation Q(u|y;7): Variational inference corresponds to 
variance-smoothed joint MAP estimation (Nickisch, 2010) with a loss function i{y,u,0) 
that explicitly depends on the kernel parameters 0. We have two equivalent representations 
of the loss £(y, u, 0) that directly follow from equations (11) and (12): 

i{y,u,9) = min[ln|K^^ + r"^|-F/i(7)-Fu^r"^u-2/3^u], and 

£{y,u,e) = min[2/3^(v - u) - 21nP(y|v) - gg{z)], v = sign(u) y^u^ + z. 

Our VB problem is min0)^o,7^o ''/'Vb(^, 7) or equivalently min^j^o.z^o V'Vb(^) z). The inner 
variables here are 7 and z, in addition to u in joint MAP. There are further similarities: since 
V'vb(^,7) = -21n/ e"'^("''>''^)du-h/i(7)-Mn|27rK0|, (7,0) H> VvB-ln|K0| is jointly convex 
for 7 ;^ 0, ^ 0, by the joint convexity of (u, 7, 0) ^ R and Prekopa's theorem (Boyd and 
Vandcnbcrghc, 2002, §3.5.2). Joint MAP and VB share the same convexity structure. In 
contrast, approximating P(y|0) by other techniques like Expectation Propagation (Minka, 
2001) or general Variational Bayes (Opper and Archambeau, 2009) does not even constitute 
convex problems for fixed 0. 

2.4 Summary and Taxonomy 

In the last paragraphs, we have detailed how a variety of kernel learning approaches can 
be obtained from Bayesian marginal likelihood maximisation in a sequence of nested upper 
bounding steps. Table 2 nicely illustrates how many kernel learning objectives are related 
to each other - either by upper bounds or by Gaussianity assumptions. We can clearly see. 
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Name 


Objective function 


Marginal Likelihood Maximisation 


^mlmW = -2 In / AA(u|0,Ke)P(y|u)du 


Variational Bounds 


0Vb(0) = niin-y>_o V'Vb(0,7) > '/>MLm(0) by ¥{yt\ui) > V-y, 


Maximum A Posteriori 


0MAp(e') = -21n[maxuAA(u|0,Ke)P(y|u)] = V^vb(6',z = 0) 


Multiple Kernel Learning 


0MKl(0) = 0MAP(0) + A ||0||^ - In \Kg\ = i'MAp{0, A = A ■ 1) 



General P(yi|ui) 



0mlm(0), eq. (5) 



Super-Gaussian Bounding J, 

c^Vb{0). eq. (10) 

Maximum instead of integral 



i 



Bound In jKej < A \\0\\p - g*{Xl) 



</'MAp(f), eq. (4) 
<f>MKh{d), eq. (3) 



Gaussian ¥{yi\ui) 
<Pgav{S), eq. (6) 

i 

0GAu(fl), eq. (6) 

i 

<^MAP/GAu(^)! eq. (8) 

i 

<^Rr{0), eq. (9) 



Bound is tight 
Mode = mean 



Table 2: Taxonomy of kernel learning objective functions 
The upper table visualises the relationship between several kernel learning objective func- 
tions for arbitrary likelihood/loss functions: Marginal likelihood maximisation (MLM) can 
be bounded by variational bounds (VB) and maximum a posteriori estimation (MAP) is a 
special case z = thereof. Finally multiple kernel learning (MKL) can be understood as an 
upper bound to the MAP estimation objective A = A • 1. The lower table complements the 
upper table by also covering the analytically important Gaussian case. 

that (pYBi^) - as an upper bound to the negative log marginal likelihood - can be seen 
as the mother function. For a special case, z = 0, we obtain joint maximum a posteriori 
estimation, where the loss functions does not depend on the kernel parameters. Going 
further, a particular instance A = A - 1 yields the widely use multiple kernel learning objective 
that becomes convex in the kernel parameters 0. In the following, we will concentrate on 
the optimisation and computational similarities between the approaches. 



3. Algorithms 

In this section, we derive a simple, provably convergent and efficient algorithm for MKL, 
joint MAP and VB. We use the Lagrangian form of equation (3) and £(y, u) := —2 lnP(y|u): 



^mkl(0,u) 
^pMAp{0,n) 

'0vb(^,u) 



u^K^u + 
u^Ka + min 

" z^O 



2/3^ 



u 



+ A-l'0, A>0, 
+ In I Kg I, and 

+ ln|Ke|, 



where v = sign(u) y^u"^ + z. 



Many previous algorithms use alternating minimization, which is easy to implement but 
tends to converge slowly. Both 0vB and c^map are jointly convex up to the concave 
6 I—)- InlK^I part. Since InlK^I = min^^o -^^^ — f*{^) (Legendre duality, Boyd and 
Vandenberghe, 2002), joint MAP becomes min;^^o,0^o,u 0a(^) u) with (l)x ■= u'''Kg + 
^(y, u) + - /*(A) which is jointly convex in {6,u). Algorithm 1 iterates between refits 
of A and joint Newton updates of {0,u). 
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Algorithm 1 Double loop algorithm for joint MAP, MKL and VB. 

Require: Criterion i^#{0, u) = ip#{0, u) + In IK^I to minimise for (u, 6) G M" x M; 
repeat 

Newton min^^ip^ for fixed (optional; few steps). 

Refit upper bound: A ^ V0ln|Ke| = [tr(K0^Ki), .., tr(K^ ^Ka/)]'^. 

Compute joint Newton search direction d for ipx := tp^ + X^O: 

Linesearch: Minimise ijj#{a) i.e. ■ip^{0,u) along [0;u] + ad, a > 0. 
until Outer loop converged 



The Newton direction costs 0(n^ +Mn^), with n the number of data points and M the 
number of base kernels. All algorithms discussed in this paper require 0{n^) time, apart 
from the requirement to store the base matrices K^- The convergence proof hinges on the 
fact that (f) and (f)x are tangentially equal (Nickisch and Seeger, 2009). Equivalently, the 
algorithm can be understood as Newton's method, yet dropping the part of the Hessian 
corresponding to the ln|K| term (note that V(u 0)(/);^ = V(u 0)0 for the Newton direction 
computation). Exact Newton for MKL. 

In practice, we use = 9m^m + el, e = 10~* to avoid numerical problems when 
computing A and ln|Kg|. We also have to enforce 6 y in algorithm 1, which is done by 
the barrier method (Boyd and Vandcnbcrghc, 2002). We minimise tcf) + (InO) instead of 
(j), increasing t > every few outer loop iterations. 

A variant algorithm 1 can be used to solve VB in a different parametrisation (7 
replaces u), which has the same convexity structure as joint MAP. Transforming equation 
(10) similarly to equation (6), we obtain 

(/)vb(0) = minln |C| - In |r| + fS'^TC^^Tp - /3^r/3 + h{-f) (13) 

with C := Ke + F, computed using the Cholesky factorisation C = LL^. They cost 
O(Mn^) to compute, which is more expensive than for joint MAP or MKL. Note that the 
cost 0{M n'^) is not specific to our particular relaxation or algorithm e.g. the Laplace MLM 
approximation (Williams and Barber, 1998), solved using gradients w.r.t. 6 only, comes 
with the same complexity. 

4. Conclusion 

We presented a unifying probabilistic viewpoint to multiple kernel learning that derives 
regularised risk approaches as special cases of approximate Bayesian inference. We provided 
an efficient and provably convergent optimisation algorithm suitable for regression, robust 
regression and classification. 

Our taxonomy of multiple kernel learning approaches connected many previously only 
loosely related ideas and provided insights into the common structure of the respective 
optimisation problems. Finally, we proposed an algorithm to solve the latter efficiently. 
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